This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. introduction:
who is interested where is data KM modeling
library(readxl)
#bd = read_excel('/Users/ritaliu/Desktop/Spring2021/Math_453/project/data_state.xlsx')
bd = read_excel('/Users/apple/Desktop/data_state.xlsx')
## New names:
## * `` -> ...1
bd[1] <- NULL
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
library(survival)
#library(ggplot2)
#library(ggmap) # for mapping points on maps
#library(tidyverse) # for data cleaning and plotting
#library(maps) # for map data
#library(leaflet) # for highly customizable mapping
#bd%>%
#select(Time,Year,Quarter,State)%>%
#group_by(State,Year)%>%
#mutate(avgtime=mean(Time))%>%
#select(avgtime,Year,State)%>%
#top_n(10)%>%
#ggplot()+
#geom_line(aes(x=Year,y=avgtime), color = "steelblue")+
#facet_wrap(vars(State)) +
#labs(x = "",
#y = "",
#title = "Business Formation Time (in quarters)") +
#scale_fill_viridis_d() +
#theme(axis.text.y = element_blank())
bd = bd%>%
mutate(Education=`Education Attainment (At least some college) in count`)
bd = bd%>%
mutate(employment_rate=total_employment/(1000*Population_thousands))
bd = bd%>%
mutate(college_rate=Education/(1000*Population_thousands))
bd
bd$Status = 1
summary(survreg(Surv(Time,Status)~exp + tax_rev + nontax_rev + real_gdp + personal_income_per_capita + employment_rate + as.factor(Quarter)+ college_rate + Year ,dist='weibull',data=bd))
##
## Call:
## survreg(formula = Surv(Time, Status) ~ exp + tax_rev + nontax_rev +
## real_gdp + personal_income_per_capita + employment_rate +
## as.factor(Quarter) + college_rate + Year, data = bd, dist = "weibull")
## Value Std. Error z p
## (Intercept) -4.37e+01 2.80e+00 -15.60 < 2e-16
## exp -1.07e-05 4.28e-06 -2.50 0.01235
## tax_rev 2.35e-05 4.83e-06 4.86 1.2e-06
## nontax_rev 1.03e-05 7.30e-06 1.41 0.15816
## real_gdp -2.21e-07 6.51e-08 -3.39 0.00069
## personal_income_per_capita 2.89e-07 7.85e-08 3.68 0.00023
## employment_rate -8.83e-01 6.42e-02 -13.76 < 2e-16
## as.factor(Quarter)2 7.94e-02 7.20e-03 11.02 < 2e-16
## as.factor(Quarter)3 7.39e-02 7.21e-03 10.26 < 2e-16
## as.factor(Quarter)4 5.80e-02 7.22e-03 8.03 9.4e-16
## college_rate 1.01e+00 9.37e-02 10.75 < 2e-16
## Year 2.21e-02 1.39e-03 15.88 < 2e-16
## Log(scale) -2.35e+00 1.94e-02 -121.34 < 2e-16
##
## Scale= 0.0949
##
## Weibull distribution
## Loglik(model)= 487 Loglik(intercept only)= 76
## Chisq= 822 on 11 degrees of freedom, p= 3.6e-169
## Number of Newton-Raphson Iterations: 5
## n=1400 (742 observations deleted due to missingness)
summary(survreg(Surv(Time,Status)~ exp + tax_rev + employment_rate + real_gdp + personal_income_per_capita + as.factor(Quarter)+ college_rate + Year ,dist='weibull',data=bd))
##
## Call:
## survreg(formula = Surv(Time, Status) ~ exp + tax_rev + employment_rate +
## real_gdp + personal_income_per_capita + as.factor(Quarter) +
## college_rate + Year, data = bd, dist = "weibull")
## Value Std. Error z p
## (Intercept) -4.38e+01 2.79e+00 -15.69 < 2e-16
## exp -5.71e-06 2.42e-06 -2.36 0.018
## tax_rev 2.08e-05 4.36e-06 4.77 1.8e-06
## employment_rate -8.66e-01 6.26e-02 -13.82 < 2e-16
## real_gdp -2.45e-07 6.24e-08 -3.93 8.4e-05
## personal_income_per_capita 3.18e-07 7.54e-08 4.22 2.4e-05
## as.factor(Quarter)2 7.90e-02 7.21e-03 10.97 < 2e-16
## as.factor(Quarter)3 7.36e-02 7.21e-03 10.21 < 2e-16
## as.factor(Quarter)4 5.77e-02 7.23e-03 7.99 1.3e-15
## college_rate 9.68e-01 8.91e-02 10.86 < 2e-16
## Year 2.22e-02 1.39e-03 15.96 < 2e-16
## Log(scale) -2.35e+00 1.94e-02 -121.39 < 2e-16
##
## Scale= 0.095
##
## Weibull distribution
## Loglik(model)= 486 Loglik(intercept only)= 76
## Chisq= 819.98 on 10 degrees of freedom, p= 1e-169
## Number of Newton-Raphson Iterations: 5
## n=1400 (742 observations deleted due to missingness)
(12 - 486)*2
## [1] -948
(13 - 487)*2
## [1] -948
According to AIC, the model with all variables is same to the model without nontax_rev. If we were to use weibull model, we could choose the full model.
Explaination of parameters:
When we look at the p-value for each covariate, we notice that only nontax_revenue’s p-value(0.15816)>0.05, all the other covariate has p-value<0.05 so that the relationship between other covariates and the entrepreneurship time is significant.
We can also seperate those covariates into two group:
group 1 (positive coefficients):tax_rev, personal_income_per_capita, college_rate, Year.
So that in group 1, those covariates would make entrepreneurship formation time longer.
tax revenue’s effect (which in the government tax revenue) can be interpreted as if government is getting more tax revenue today, people’s incentive for forming a firm can be lower as they are worrying that more money would be taken by the government.
personal_income_per_capita’s effect can be interpreted as if people is alredy getting their target income, their incentive to start a firm can also be lowered as most people are not risk takers so that they would settle down with a steady income, so that their plans for forming a firm can be pushed back.
college_rate’s effect can be interpreted as if they already get a college degree, they have higher probability of getting a good job, unlike those who do not have a college degree, they are not on a all or nothing stage so that they may not be willing to give up those jobs that they already get and fully try to get the company started, they may consider more in tring to start a film, which may delay the formation time.
Year’s effect is a little bit tricky, it indicates that the formation time has been longer comparing to the earlier years, but we should also acknowledge that the big recession happened in 2008 and our data is from 2004-2014, so that the longer time can be due to the loss of trust in the economy.
group 2(negative coefficients):exp,real_gdp,employment_rate
So that in group 2, those covariates would make entrepreneurship formation time shorter.
exp(expenditure)’s effect can be interpreted as if people’s expenditure is increasing, their living cost can be increased, so that they are intending to find other ways to increase their income, thus they are tending to get the firm ready quicker.
real_gdp’s effect can be interpreted as people gain faith in the economy so that they would love to try out more possibilities.
?employment_rate’s effect can be interpreted as the whole economy is getting in a positive direction so that people may tend to start a company when the economy is great as a whole.
We also treat quarters as factors, and all the coefficients for quarter 2,3,4 are positive, which indicates that in those quarters, people tends to take longer time is firming a company, which can be the incentives to make a change is always the highest in the beginning of a year (quarter 1).
summary(survreg(Surv(Time,Status)~exp + tax_rev + nontax_rev + real_gdp + personal_income_per_capita + employment_rate + college_rate + as.factor(Quarter) + Year ,dist='lognormal',data=bd))
##
## Call:
## survreg(formula = Surv(Time, Status) ~ exp + tax_rev + nontax_rev +
## real_gdp + personal_income_per_capita + employment_rate +
## college_rate + as.factor(Quarter) + Year, data = bd, dist = "lognormal")
## Value Std. Error z p
## (Intercept) -5.25e+01 2.84e+00 -18.47 < 2e-16
## exp -8.62e-06 3.67e-06 -2.35 0.01879
## tax_rev 1.36e-05 3.80e-06 3.56 0.00037
## nontax_rev 1.10e-05 6.02e-06 1.82 0.06803
## real_gdp -1.25e-07 7.60e-08 -1.64 0.10094
## personal_income_per_capita 2.19e-07 9.32e-08 2.34 0.01910
## employment_rate -8.78e-01 5.43e-02 -16.16 < 2e-16
## college_rate 1.28e+00 9.26e-02 13.79 < 2e-16
## as.factor(Quarter)2 6.40e-02 7.24e-03 8.85 < 2e-16
## as.factor(Quarter)3 7.91e-02 7.24e-03 10.92 < 2e-16
## as.factor(Quarter)4 7.04e-02 7.25e-03 9.71 < 2e-16
## Year 2.65e-02 1.42e-03 18.70 < 2e-16
## Log(scale) -2.35e+00 1.89e-02 -124.17 < 2e-16
##
## Scale= 0.0957
##
## Log Normal distribution
## Loglik(model)= 601.4 Loglik(intercept only)= 85.7
## Chisq= 1031.45 on 11 degrees of freedom, p= 3.3e-214
## Number of Newton-Raphson Iterations: 5
## n=1400 (742 observations deleted due to missingness)
summary(survreg(Surv(Time,Status)~ exp + tax_rev + personal_income_per_capita + employment_rate + college_rate + as.factor(Quarter) + Year ,dist='lognormal',data=bd))
##
## Call:
## survreg(formula = Surv(Time, Status) ~ exp + tax_rev + personal_income_per_capita +
## employment_rate + college_rate + as.factor(Quarter) + Year,
## data = bd, dist = "lognormal")
## Value Std. Error z p
## (Intercept) -5.42e+01 2.65e+00 -20.44 < 2e-16
## exp -4.34e-06 2.49e-06 -1.74 0.08173
## tax_rev 1.34e-05 3.80e-06 3.53 0.00041
## personal_income_per_capita 6.34e-08 9.11e-09 6.96 3.5e-12
## employment_rate -8.86e-01 5.37e-02 -16.48 < 2e-16
## college_rate 1.23e+00 8.34e-02 14.78 < 2e-16
## as.factor(Quarter)2 6.44e-02 7.25e-03 8.88 < 2e-16
## as.factor(Quarter)3 7.95e-02 7.25e-03 10.96 < 2e-16
## as.factor(Quarter)4 7.11e-02 7.25e-03 9.81 < 2e-16
## Year 2.73e-02 1.32e-03 20.70 < 2e-16
## Log(scale) -2.34e+00 1.89e-02 -124.04 < 2e-16
##
## Scale= 0.0959
##
## Log Normal distribution
## Loglik(model)= 598 Loglik(intercept only)= 85.7
## Chisq= 1024.46 on 9 degrees of freedom, p= 9.1e-215
## Number of Newton-Raphson Iterations: 5
## n=1400 (742 observations deleted due to missingness)
(13 - 601.4)*2
## [1] -1176.8
(11 - 598 )*2
## [1] -1174
If we use lognormal model, we would choose to use the full model since the AIC for this model is lower.
Since loglikelihood for the lognormal model is larger than that of weibull model and two models (the full versions) have same amount of parameters, according to AIC, the Lognormal model is better. The covariates we would use are: exp, nontax_rev, tax_rev, real_gdp, personal_income_per_capita, employment_rate, eudcation_rate, quarter, and year.
Explaination of parameters for lognormal model:
When we look at the p-value for each covariate, we notice that only nontax_revenue’s p-value(0.06803)>0.05, all the other covariate has p-value<0.05 so that the relationship between other covariates and the entrepreneurship time is significant.
We can also seperate those covariates into two group:
group 1 (positive coefficients):tax_rev, personal_income_per_capita, college_rate, Year.
So that in group 1, those covariates would make entrepreneurship formation time longer.
tax revenue’s effect (which in the government tax revenue) can be interpreted as if government is getting more tax revenue today, people’s incentive for forming a firm can be lower as they are worrying that more money would be taken by the government.
personal_income_per_capita’s effect can be interpreted as if people is alredy getting their target income, their incentive to start a firm can also be lowered as most people are not risk takers so that they would settle down with a steady income, so that their plans for forming a firm can be pushed back.
college_rate’s effect can be interpreted as if they already get a college degree, they have higher probability of getting a good job, unlike those who do not have a college degree, they are not on a all or nothing stage so that they may not be willing to give up those jobs that they already get and fully try to get the company started, they may consider more in tring to start a film, which may delay the formation time.
Year’s effect is a little bit tricky, it indicates that the formation time has been longer comparing to the earlier years, but we should also acknowledge that the big recession happened in 2008 and our data is from 2004-2014, so that the longer time can be due to the loss of trust in the economy.
group 2(negative coefficients):exp,real_gdp,employment_rate
So that in group 2, those covariates would make entrepreneurship formation time shorter.
exp(expenditure)’s effect can be interpreted as if people’s expenditure is increasing, their living cost can be increased, so that they are intending to find other ways to increase their income, thus they are tending to get the firm ready quicker.
real_gdp’s effect can be interpreted as people gain faith in the economy so that they would love to try out more possibilities.
?employment_rate’s effect can be interpreted as the whole economy is getting in a positive direction so that people may tend to start a company when the economy is great as a whole.
We also treat quarters as factors, and all the coefficients for quarter 2,3,4 are positive, which indicates that in those quarters, people tends to take longer time is firming a company, which can be the incentives to make a change is always the highest in the beginning of a year (quarter 1).
data.frame(summary(bd))
# effect of exp/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *5120 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *6802 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7521 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *8560 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *19586 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *6802 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),col = 'red',ylab="Survival Proporton" , xlab="Formation Time(in quarters)",main="Business Formation Time (holding expenditure constant)")
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7521 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *8560 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'purple')
# effect of education
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1269 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1729 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1937 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2235 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.4231 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1729 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),ylab="Survival Proporton" , xlab="Formation Time(in quarters)",col = 'red',main="Business Formation Time (holding college rate constant)")
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1937 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2235 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'purple')
# effect of tax/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 2335 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 3364 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 3840 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4540 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 14609 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# effect of real_gdp/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 26614 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 73712 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 178486 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 403734 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 2355040 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# effect of personal_income/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *17998 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *59665 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *155226 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *313185 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *2075645 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# effect of employment_rate/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.4885 + 1.28e+00 * 0.2030 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)")
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.5685 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.5979 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6478 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 1.4399 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.5685 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),ylab="Survival Proporton" , xlab="Formation Time(in quarters)",col = 'red',main="Business Formation Time (holding employment rate constant)")
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.5979 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6478 + 1.28e+00 * 0.2030 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(1.1,2.1),ylim=c(0,1),add = TRUE,col = 'purple')
# effect of college_rate/ year = 2008, quarter =1, other = mean
# min
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1269 + 2.65e-02 *2008, 0.0957),xlim=c(1,2.5),ylab="Survival Proporton" , xlab="Formation Time(in quarters)" )
# Q1
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1729 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'red')
# Q2
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.1937 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'blue')
# Q3
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.2235 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'purple')
# max
curve (1 - plnorm(x, -5.25e+01 + -8.62e-06 *7907 + 1.36e-05 * 4160 + 1.10e-05 *2074 + -1.25e-07 * 308496 + 2.19e-07 *247566 + -8.78e-01 * 0.6218 + 1.28e+00 * 0.4231 + 2.65e-02 *2008,sd = 0.0957 ),xlim=c(0,2.5),ylim=c(0,1),add = TRUE,col = 'orange')
# we can use KM models to compare between Quarters
km = survfit( Surv(Time) ~ Quarter , data=bd )
plot(km,col=c("black","red","blue","purple"),ylab="Survival Proporton" , xlab="Formation Time(in quarters)", main="Business Formation Time by different Quarters")
Quarter 1, which is in black, has the lowest survival within all 4 quarters, and this would leads it to have the shortest formation time, while the other three quarters(Q2,Q3,Q4) are tending to have similiar and longer survival than Q1, which mateches with our model’s coefficients.
# didn't include nontax since if it is included, cox.mph would have error: system is computationally singular: reciprocal condition number = 2.04324e-16
m = coxph(Surv(Time,Status)~exp + tax_rev + real_gdp + personal_income_per_capita + employment_rate + college_rate + as.factor(Quarter) + Year,data = bd)
m
## Call:
## coxph(formula = Surv(Time, Status) ~ exp + tax_rev + real_gdp +
## personal_income_per_capita + employment_rate + college_rate +
## as.factor(Quarter) + Year, data = bd)
##
## coef exp(coef) se(coef) z p
## exp 5.759e-05 1.000e+00 2.548e-05 2.260 0.023808
## tax_rev -2.107e-04 9.998e-01 4.634e-05 -4.547 5.44e-06
## real_gdp 2.285e-06 1.000e+00 6.782e-07 3.370 0.000752
## personal_income_per_capita -3.010e-06 1.000e+00 8.187e-07 -3.676 0.000237
## employment_rate 9.080e+00 8.780e+03 6.828e-01 13.298 < 2e-16
## college_rate -1.008e+01 4.204e-05 9.432e-01 -10.684 < 2e-16
## as.factor(Quarter)2 -7.897e-01 4.540e-01 7.872e-02 -10.032 < 2e-16
## as.factor(Quarter)3 -7.642e-01 4.657e-01 7.704e-02 -9.920 < 2e-16
## as.factor(Quarter)4 -6.054e-01 5.458e-01 7.655e-02 -7.909 2.59e-15
## Year -2.302e-01 7.944e-01 1.512e-02 -15.220 < 2e-16
##
## Likelihood ratio test=776.6 on 10 df, p=< 2.2e-16
## n= 1400, number of events= 1400
## (742 observations deleted due to missingness)
cox.zph(m)
## chisq df p
## exp 3.4 1 0.065
## tax_rev 6.6 1 0.010
## real_gdp 163.8 1 < 2e-16
## personal_income_per_capita 157.5 1 < 2e-16
## employment_rate 18.7 1 1.5e-05
## college_rate 56.0 1 7.1e-14
## as.factor(Quarter) 38.5 3 2.2e-08
## Year 51.5 1 7.3e-13
## GLOBAL 314.9 10 < 2e-16
plot(cox.zph(m))
> From residual graph of year, we could see that the log(HR) is always lower than 1, meaning that it’s harder and harder to realize a bussiness as time passes but it is even harder at the beiginning years.